Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122
Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122asolimando wants to merge 16 commits intoapache:mainfrom
Conversation
322b97f to
f101c51
Compare
dfa1324 to
f6f27ac
Compare
|
@2010YOUY01: FYI I took a final pass on the PR and marked it as "reviewable" |
Thanks to you @kosiew for the spot-on review, and for sharing your feedback. I have addressed the requested changes, happy to iterate further if needed! |
There was a problem hiding this comment.
Thanks for the follow-up here. The issues called out in the earlier review look addressed, including the projection ordering fix, the equality and inequality selectivity registry lookup, and narrowing the injective arithmetic NDV rule. I did notice one remaining gap around registry propagation through planner and optimizer-created projections.
kosiew
left a comment
There was a problem hiding this comment.
@asolimando, I spotted an issue with NDV handling that can affect selectivity estimates depending on operand order.
…propagation (#21483) ## Which issue does this PR close? - Part of #21443 (Pluggable operator-level statistics propagation) - Part of #8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (#21120) for expression-level stats. See #21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; #20184 would close this gap. - No `ExpressionAnalyzer` integration yet (#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.
…propagation (apache#21483) ## Which issue does this PR close? - Part of apache#21443 (Pluggable operator-level statistics propagation) - Part of apache#8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (apache#21120) for expression-level stats. See apache#21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; apache#20184 would close this gap. - No `ExpressionAnalyzer` integration yet (apache#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.
54a0df7 to
ae2a0b8
Compare
c37838b to
42d0f8e
Compare
|
@kosiew the CI error seems unrelated, the same test suite passed for EDIT: confirmed it was a flaky test, fixed by #21657, force-pushing to resolve conflicts with main, no real changes w.r.t. the updated status I gave above, just a mechanical fix due to |
b6d2621 to
189287f
Compare
kosiew
left a comment
There was a problem hiding this comment.
Thanks for the updates here, a lot of the earlier concerns have been addressed nicely. The projection ordering fixes and the tighter NDV handling look solid. I took another pass and things are generally in good shape, but there are still a couple of edge cases and behavioral inconsistencies worth tightening up before landing.
Introduce ExpressionAnalyzer, a chain-of-responsibility framework for expression-level statistics estimation (NDV, selectivity, min/max). Framework: - ExpressionAnalyzer trait with registry parameter for chain delegation - ExpressionAnalyzerRegistry to chain analyzers (first Computed wins) - DefaultExpressionAnalyzer: Selinger-style estimation for columns, literals, binary expressions, NOT, boolean predicates Integration: - ExpressionAnalyzerRegistry stored in SessionState, initialized once - ProjectionExprs stores optional registry (non-breaking, no signature changes to project_statistics) - ProjectionExec sets registry via Projector, injected by planner - FilterExec uses registry for selectivity when interval analysis cannot handle the predicate - Custom nodes get builtin analyzer as fallback when registry is absent
- Regenerate configs.md for new enable_expression_analyzer option - Add enable_expression_analyzer to information_schema.slt expected output - Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer (cross-crate references use backticks instead of doc links) - Simplify config description
…putation - Fix expression_analyzer_registry doc comment misplaced between function_factory's doc comment and field declaration - Fix module doc example import path (physical_plan -> physical_expr) - Extract expression_analyzer_registry() helper in planner to avoid repeating the config check 4 times - Defer left_sel/right_sel computation to AND/OR arms only, avoiding unnecessary sub-expression selectivity estimation for comparison operators
…ptimizer loop Add trait methods on ExecutionPlan for expression-level statistics injection (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry). The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, gated by the use_expression_analyzer config flag.
…ty for OR predicates OR predicates are inherently outside interval arithmetic (a union of two disjoint intervals cannot be represented as a single interval). This test confirms that ExpressionAnalyzerRegistry computes the correct inclusion-exclusion selectivity (0.28 = 0.1 + 0.2 - 0.02) on a 1000-row input, versus the default 20% (200 rows) without a registry.
…ailable Return Delegate for all leaf predicates when NDV is unavailable, and propagate Delegate upward through AND/OR/NOT when any child has no estimate. DefaultExpressionAnalyzer now only produces a result when it has a genuine information advantage (NDV from column statistics).
… OR selectivity Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied statistics) to the sqllogictest harness, and use it in expression_analyzer.slt
…e no-op rules gap
…ct column stats when no ExpressionAnalyzer
kosiew
left a comment
There was a problem hiding this comment.
This looks good to me.
Looking forward to approving this after merge conflict is resolved.
b79d090 to
f2708a7
Compare
Thanks @kosiew, I have resolved the conflicts and force-pushed again, mechanical fixes only. |
…propagation (apache#21483) ## Which issue does this PR close? - Part of apache#21443 (Pluggable operator-level statistics propagation) - Part of apache#8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (apache#21120) for expression-level stats. See apache#21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; apache#20184 would close this gap. - No `ExpressionAnalyzer` integration yet (apache#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.
xudong963
left a comment
There was a problem hiding this comment.
Thanks @asolimando!
The algorithmic contribution is genuinely useful, and thanks for your patience in keeping at it.
But the registry-injection plumbing adds three permanent ExecutionPlan trait methods and five struct fields to work around a missing parameter that #20184 is poised to add properly.
Before landing, I'd want alignment on whether to (a) wait for #20184 and land this on top of a StatisticsContext parameter, or (b) accept the injection design as a permanent API surface
Thanks @xudong963 for your thoughtful feedback, the injection design was meant as a stepping stone, not a permanent API surface, and I totally agree that adding the Re. (b), I was under the impression that #20184 might land pretty soon, so I was basically counting on rebasing before finalizing, or shortly after, as breaking changes within the same unreleased version are generally tolerated, but if the timeline to merge #20184 is uncertain, (b) might not be ideal. The cons of (a) is the risk of conflicts requiring to force-push, making it harder for reviewers to check incrementally. In case it makes your decision easier, I can commit on helping with #20184's implementation, under both scenarios, if the current assignee is busy. WDYT? |
Yes, please. If the current assignee is busy, taking that on is probably the fastest path to unblocking this PR in its final form. Happy to review on that side too. Also cc @jonathanc-n I also want to hear some suggestions from @alamb about the next step! |
Hey @xudong963, I have opened #21815 to close #20184 as requested (I will ping you there too) |
Which issue does this PR close?
Part of #21120 (framework + projection/filter integration)
Rationale for this change
DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate (e.g. OR predicates, which are not expressible as a single interval). There is also no extension point for users to provide statistics for their own UDFs.
This PR introduces
ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner,OptimizerRule,StatisticsRegistry).What changes are included in this PR?
ExpressionAnalyzertrait andExpressionAnalyzerRegistry(chain-of-responsibility, firstComputedwins)DefaultExpressionAnalyzerwith Selinger-style estimation: equality/inequality via NDV, AND/OR via inclusion-exclusion, injective arithmetic (+/-), literals, NOTProjectionExprsandFilterExecuse the registry for expression-level statisticsExecutionPlan(uses_expression_level_statistics,with_expression_analyzer_registry,expression_analyzer_registry) for injection, overridden byFilterExec,ProjectionExec,AggregateExec,HashJoinExec, andSortMergeJoinExecAggregateStatisticsProviderandJoinStatisticsProvider(feat: Add pluggable StatisticsRegistry for operator-level statistics propagation #21483) consume the registry via the trait getteroptimizer.use_expression_analyzer(default false), zero overhead when disabledAre these changes tested?
Are there any user-facing changes?
New public API (purely additive, non-breaking):
ExpressionAnalyzertrait andExpressionAnalyzerRegistryindatafusion-physical-exprSessionState::expression_analyzer_registry()getterSessionStateBuilder::with_expression_analyzer_registry()setterExecutionPlan:uses_expression_level_statistics(),with_expression_analyzer_registry(),expression_analyzer_registry()datafusion.optimizer.use_expression_analyzerNo breaking changes. Default behavior is unchanged (config defaults to false).
Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.